feat: add native support for get_json_object expression#3747
Draft
andygrove wants to merge 6 commits intoapache:mainfrom
Draft
feat: add native support for get_json_object expression#3747andygrove wants to merge 6 commits intoapache:mainfrom
andygrove wants to merge 6 commits intoapache:mainfrom
Conversation
Implement the Spark GetJsonObject expression natively using serde_json for JSON parsing and a custom JSONPath evaluator supporting field access, array indexing, bracket notation, and wildcards. Closes apache#3162
Mark as Incompatible since Spark's Jackson parser allows single-quoted JSON and unescaped control characters which serde_json does not support. Add allowIncompatible config to SQL test file.
- Enable serde_json preserve_order feature to maintain JSON key ordering - Fix wildcard to only work on arrays (not objects), matching Spark - Fix single wildcard match to preserve JSON string quoting - Add user-facing docs in expressions.md - Add more SQL tests: object wildcard, single match, missing fields, invalid paths, field names with special chars, key ordering - Add Rust unit tests for new edge cases
Benchmarks simple field, numeric field, nested field, array element, and nested object extraction with 1M rows of JSON data.
- Move StringBuilder import to top-level imports - Fix doc comment to not mention object wildcard (unsupported) - Pre-compute has_wildcard in ParsedPath struct (avoids per-row scan) - Split evaluation into evaluate_no_wildcard (returns Option, zero Vec allocations) and evaluate_with_wildcard (returns Vec) - Simplify single wildcard match: serialize directly instead of array-wrap-then-strip-brackets hack - Add comment in Cargo.toml explaining preserve_order requirement
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3162.
Rationale for this change
get_json_objectis a widely-used Spark function for extracting values from JSON strings using JSONPath expressions. Without native support, queries using this function fall back to Spark's JVM execution. This PR adds an initial native implementation to allow Comet to accelerate these queries.This is a starting point. The expression is marked
Incompatibleand is disabled by default. Users must setspark.comet.expression.GetJsonObject.allowIncompatible=trueto enable it.What changes are included in this PR?
Rust implementation (
native/spark-expr/src/string_funcs/get_json_object.rs):$(root),.field,['field'](bracket notation),[n](array index), and[*](array wildcard)serde_jsonwithpreserve_orderfeature for Spark-compatible key orderingScala serde (
spark/src/main/scala/org/apache/comet/serde/strings.scala):CometGetJsonObjectwithgetSupportLevelreturningIncompatible(Spark's Jackson parser allows single-quoted JSON and unescaped control characters thatserde_jsondoes not)Registration and wiring:
stringExpressionsmap inQueryPlanSerde.scalacomet_scalar_funcs.rsviascalarFunctionExprToProtoWithReturnTypeSQL tests (
get_json_object.sql): 30 test queries covering field extraction, nested objects, arrays, wildcards, nulls, invalid JSON, bracket notation, edge cases.Docs: Updated
expressions.mdandspark_expressions_support.md.Current performance
Benchmarked with 1M rows of JSON (~200 bytes each) on Apple M3 Ultra:
$.name)$.age)$.address.city)$.items[0])$.address)Comet is currently ~10% slower than Spark. The primary reason is that
serde_jsonparses the full JSON document into a DOM tree on every row, while Spark's Jackson-based implementation uses a streaming parser that can skip irrelevant fields without allocating.Known limitations and future work
This is an initial implementation. Known gaps that could be addressed in follow-up PRs:
serde_json::from_str(full DOM parse) with a streaming approach (e.g.,jiteror customserde_json::DeserializerwithIgnoredAny) to skip irrelevant JSON content without allocating. This would likely close the performance gap with Spark.$.*on arrays: Spark distinguishes$.*(object wildcard, usingWildcardtoken) from$[*](array wildcard, usingSubscript::Wildcard). Our parser treats both as the sameWildcardsegment. Currently$.*on arrays returns values in Comet but null in Spark.$[*][*]triggersFlattenStylewhich flattens nested arrays. Our implementation doesn't handle this special case.$.arr[0][*].field, Spark'sWriteStylestate machine may produce different wrapping behavior than our count-based approach.preserve_orderis workspace-wide: Cargo unifies features, so enablingpreserve_orderonserde_jsoninspark-expralso enables it for all other crates in the workspace. Could be addressed by isolating the JSON parsing behind a feature flag.How are these changes tested?
CometSqlFileTestSuite) that run each query through both Spark and Comet and compare results, with dictionary encoding on/offCometGetJsonObjectBenchmark) comparing Spark vs Comet performance across 5 query patterns